revert pinned hostmem for pod comm by AtlantaPepsi · Pull Request #292 · ROCm/TransferBench

AtlantaPepsi · 2026-05-08T20:50:15Z

Motivation

This is to add back the option of extended GPU memory in cross pod transfer.

Technical Details

Removed CheckPages() inside AllocateMemory() for pinned host memory. This would cause a silent fail on the node where it's allocated, and other nodes will hang at broadcast inside ExchangeMemory.

It's also not tested yet on either Nvidia or AMD platforms.

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

This PR adjusts pod-communication (fabric-handle) memory allocation to better support exporting extended (HOST_NUMA) memory across pods, and avoids a failure mode where NUMA page validation caused one rank to error while other ranks hung in collectives.

Changes:

Add a GetMemLocation() helper to correctly select DEVICE vs HOST_NUMA locations for pod-comm VMM allocations and access descriptors.
Remove CheckPages()/move_pages() validation for fabric-exportable host allocations (keep zeroing), with an explanatory comment.
Extend CUDA-compat macro aliases/undefs to include hipMemLocation and hipMemLocationTypeHostNuma.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    // Determine location id
+    if (memDevice.memType == MEM_CPU_CLOSEST) {
+      location.id = GetClosestCpuNumaToGpu(memDevice.memIndex);
+    } else {
+      location.id = memDevice.memIndex;
+    }


revert pinned hostmem for pod comm

b75b28f

Copilot AI review requested due to automatic review settings May 8, 2026 20:50

AtlantaPepsi requested a review from a team as a code owner May 8, 2026 20:50

Copilot started reviewing on behalf of AtlantaPepsi May 8, 2026 20:51 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

Comment thread src/header/TransferBench.hpp

Comment on lines +1426 to +1431

// Determine location id

if (memDevice.memType == MEM_CPU_CLOSEST) {

location.id = GetClosestCpuNumaToGpu(memDevice.memIndex);

} else {

location.id = memDevice.memIndex;

}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

revert pinned hostmem for pod comm#292

revert pinned hostmem for pod comm#292
AtlantaPepsi wants to merge 1 commit into
ROCm:candidatefrom
AtlantaPepsi:EGM

AtlantaPepsi commented May 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AtlantaPepsi commented May 8, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants